30 research outputs found
Backtracking Spatial Pyramid Pooling (SPP)-based Image Classifier for Weakly Supervised Top-down Salient Object Detection
Top-down saliency models produce a probability map that peaks at target
locations specified by a task/goal such as object detection. They are usually
trained in a fully supervised setting involving pixel-level annotations of
objects. We propose a weakly supervised top-down saliency framework using only
binary labels that indicate the presence/absence of an object in an image.
First, the probabilistic contribution of each image region to the confidence of
a CNN-based image classifier is computed through a backtracking strategy to
produce top-down saliency. From a set of saliency maps of an image produced by
fast bottom-up saliency approaches, we select the best saliency map suitable
for the top-down task. The selected bottom-up saliency map is combined with the
top-down saliency map. Features having high combined saliency are used to train
a linear SVM classifier to estimate feature saliency. This is integrated with
combined saliency and further refined through a multi-scale
superpixel-averaging of saliency map. We evaluate the performance of the
proposed weakly supervised topdown saliency and achieve comparable performance
with fully supervised approaches. Experiments are carried out on seven
challenging datasets and quantitative results are compared with 40 closely
related approaches across 4 different applications.Comment: 14 pages, 7 figure
DDAM-PS: Diligent Domain Adaptive Mixer for Person Search
Person search (PS) is a challenging computer vision problem where the
objective is to achieve joint optimization for pedestrian detection and
re-identification (ReID). Although previous advancements have shown promising
performance in the field under fully and weakly supervised learning fashion,
there exists a major gap in investigating the domain adaptation ability of PS
models. In this paper, we propose a diligent domain adaptive mixer (DDAM) for
person search (DDAP-PS) framework that aims to bridge a gap to improve
knowledge transfer from the labeled source domain to the unlabeled target
domain. Specifically, we introduce a novel DDAM module that generates moderate
mixed-domain representations by combining source and target domain
representations. The proposed DDAM module encourages domain mixing to minimize
the distance between the two extreme domains, thereby enhancing the ReID task.
To achieve this, we introduce two bridge losses and a disparity loss. The
objective of the two bridge losses is to guide the moderate mixed-domain
representations to maintain an appropriate distance from both the source and
target domain representations. The disparity loss aims to prevent the moderate
mixed-domain representations from being biased towards either the source or
target domains, thereby avoiding overfitting. Furthermore, we address the
conflict between the two subtasks, localization and ReID, during domain
adaptation. To handle this cross-task conflict, we forcefully decouple the
norm-aware embedding, which aids in better learning of the moderate
mixed-domain representation. We conduct experiments to validate the
effectiveness of our proposed method. Our approach demonstrates favorable
performance on the challenging PRW and CUHK-SYSU datasets. Our source code is
publicly available at \url{https://github.com/mustansarfiaz/DDAM-PS}.Comment: Accepted in WACV-2024. Code is here at
\url{https://github.com/mustansarfiaz/DDAM-P
SA2-Net: Scale-aware Attention Network for Microscopic Image Segmentation
Microscopic image segmentation is a challenging task, wherein the objective
is to assign semantic labels to each pixel in a given microscopic image. While
convolutional neural networks (CNNs) form the foundation of many existing
frameworks, they often struggle to explicitly capture long-range dependencies.
Although transformers were initially devised to address this issue using
self-attention, it has been proven that both local and global features are
crucial for addressing diverse challenges in microscopic images, including
variations in shape, size, appearance, and target region density. In this
paper, we introduce SA2-Net, an attention-guided method that leverages
multi-scale feature learning to effectively handle diverse structures within
microscopic images. Specifically, we propose scale-aware attention (SA2) module
designed to capture inherent variations in scales and shapes of microscopic
regions, such as cells, for accurate segmentation. This module incorporates
local attention at each level of multi-stage features, as well as global
attention across multiple resolutions. Furthermore, we address the issue of
blurred region boundaries (e.g., cell boundaries) by introducing a novel
upsampling strategy called the Adaptive Up-Attention (AuA) module. This module
enhances the discriminative ability for improved localization of microscopic
regions using an explicit attention mechanism. Extensive experiments on five
challenging datasets demonstrate the benefits of our SA2-Net model. Our source
code is publicly available at \url{https://github.com/mustansarfiaz/SA2-Net}.Comment: BMVC 2023 accepted as ora
Handling Data Heterogeneity via Architectural Design for Federated Visual Recognition
Federated Learning (FL) is a promising research paradigm that enables the
collaborative training of machine learning models among various parties without
the need for sensitive information exchange. Nonetheless, retaining data in
individual clients introduces fundamental challenges to achieving performance
on par with centrally trained models. Our study provides an extensive review of
federated learning applied to visual recognition. It underscores the critical
role of thoughtful architectural design choices in achieving optimal
performance, a factor often neglected in the FL literature. Many existing FL
solutions are tested on shallow or simple networks, which may not accurately
reflect real-world applications. This practice restricts the transferability of
research findings to large-scale visual recognition models. Through an in-depth
analysis of diverse cutting-edge architectures such as convolutional neural
networks, transformers, and MLP-mixers, we experimentally demonstrate that
architectural choices can substantially enhance FL systems' performance,
particularly when handling heterogeneous data. We study 19 visual recognition
models from five different architectural families on four challenging FL
datasets. We also re-investigate the inferior performance of convolution-based
architectures in the FL setting and analyze the influence of normalization
layers on the FL performance. Our findings emphasize the importance of
architectural design for computer vision tasks in practical scenarios,
effectively narrowing the performance gap between federated and centralized
learning. Our source code is available at
https://github.com/sarapieri/fed_het.git.Comment: to be published in NeurIPS 202
PS-ARM: An End-to-End Attention-aware Relation Mixer Network for Person Search
Person search is a challenging problem with various real-world applications,
that aims at joint person detection and re-identification of a query person
from uncropped gallery images. Although, the previous study focuses on rich
feature information learning, it is still hard to retrieve the query person due
to the occurrence of appearance deformations and background distractors. In
this paper, we propose a novel attention-aware relation mixer (ARM) module for
person search, which exploits the global relation between different local
regions within RoI of a person and make it robust against various appearance
deformations and occlusion. The proposed ARM is composed of a relation mixer
block and a spatio-channel attention layer. The relation mixer block introduces
a spatially attended spatial mixing and a channel-wise attended channel mixing
for effectively capturing discriminative relation features within an RoI. These
discriminative relation features are further enriched by introducing a
spatio-channel attention where the foreground and background discriminability
is empowered in a joint spatio-channel space. Our ARM module is generic and it
does not rely on fine-grained supervision or topological assumptions, hence
being easily integrated into any Faster R-CNN based person search methods.
Comprehensive experiments are performed on two challenging benchmark datasets:
CUHKSYSU and PRW. Our PS-ARM achieves state-of-the-art performance on both
datasets. On the challenging PRW dataset, our PS-ARM achieves an absolute gain
of 5 in the mAP score over SeqNet, while operating at a comparable speed.Comment: Paper accepted in ACCV 202
Salient Mask-Guided Vision Transformer for Fine-Grained Classification
Fine-grained visual classification (FGVC) is a challenging computer vision
problem, where the task is to automatically recognise objects from subordinate
categories. One of its main difficulties is capturing the most discriminative
inter-class variances among visually similar classes. Recently, methods with
Vision Transformer (ViT) have demonstrated noticeable achievements in FGVC,
generally by employing the self-attention mechanism with additional
resource-consuming techniques to distinguish potentially discriminative regions
while disregarding the rest. However, such approaches may struggle to
effectively focus on truly discriminative regions due to only relying on the
inherent self-attention mechanism, resulting in the classification token likely
aggregating global information from less-important background patches.
Moreover, due to the immense lack of the datapoints, classifiers may fail to
find the most helpful inter-class distinguishing features, since other
unrelated but distinctive background regions may be falsely recognised as being
valuable. To this end, we introduce a simple yet effective Salient Mask-Guided
Vision Transformer (SM-ViT), where the discriminability of the standard ViT`s
attention maps is boosted through salient masking of potentially discriminative
foreground regions. Extensive experiments demonstrate that with the standard
training procedure our SM-ViT achieves state-of-the-art performance on popular
FGVC benchmarks among existing ViT-based approaches while requiring fewer
resources and lower input image resolution.Comment: Accepted by VISAPP 2023 (Best Student Paper Award